Skip to content

Conversation

janhoy
Copy link
Contributor

@janhoy janhoy commented Sep 19, 2025

https://issues.apache.org/jira/browse/SOLR-7632

This work builds on the one in #3361 but instead of making a new module, we add it as a capability to the existing extraction handler through specifying extraction.backend=tikaserver.

This first required refactoring extraction handler to detach it from the Tika-v1 API. There is a new interface ExtractionBackend that takes generic ExtractionRequest object in and returns an ExtractionResult bean, and a new LocalTikaExtractionBackend implementation that encapsulates all Tikav1 api handling. This implementation can be deprecated, and in Solr 10, the tikaserver one can be made default.

All existing tests pass, and most of the existing extraction tests now also pass when running the tikaserver backend (running in TestContainers). Unfortunately docker is not available in Crave, so a new GH workflow is made to run only the extraction tests.

TODO's:

  • Lots of code from GenAI, which needs review and rewrite / simplification.
  • There may be debug print and TODOs left here and there
  • The metadata back-compat map is AI generated and by no means complete or even correct 🤣
  • Lack of JavaDoc everywhere
  • Harden exception handling, retrying, timeout values etc
  • Should probably use Jetty HTTP client instead of JDK one.
  • Optimize throughput with async methods? The request thread is blocked on Tika response
  • Explicit HTTPS / self-signed cert support?
  • Complete the RefGuide docs

@janhoy janhoy marked this pull request as draft September 19, 2025 15:14
@janhoy janhoy requested a review from epugh September 19, 2025 15:14
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Sep 19, 2025
@epugh
Copy link
Contributor

epugh commented Sep 19, 2025

Exciting!

@janhoy
Copy link
Contributor Author

janhoy commented Sep 20, 2025

Status:

  • Parses docs using TikaServer
  • Can switch between xml (html) and text format of the content field
  • Randomized the choice of backend for the main test class
  • ExtractOnly not fully implemented for tikaserver, some tests fail

TBD:

  • The whole xpath / SAX parsing of XML response is missing
  • We use JDK HTTP client, could perhaps use Jetty client. See other POC for example, including making timeouts configurable
  • Must make sure that tikaserver.url is only configurable on requesthandler config in solrconfig, not as a request parameter (security)
  • RefGuide docs, especially how to start TikaServer etc
  • Remove the DummyExtractionBackend

Anyone, please feel free to hack away on this if it looks exciting, committing directly to the PR branch.

Question: Would it bring value to isolate the refactoring in one PR and then another one to add the tikaserver impl?

Cleanup TestContainer
Refactor ExtractionMetadata
Add returnType to ExtractionRequest
Remove static initializers
@janhoy janhoy force-pushed the refactor-extraction-handler branch from cc3d43f to a3794ce Compare September 20, 2025 01:24
@epugh
Copy link
Contributor

epugh commented Sep 21, 2025

Any luck with security manager?? I had many difficulties

@epugh
Copy link
Contributor

epugh commented Sep 22, 2025

Testcontainers and docker don't love the SecurityManager. I had claude try to run the tests and add additional permissions to solr-tests.policy, and after an hour or so, I had a lot more, but no love:

// Needed for testcontainers
  permission java.io.FilePermission "/Users/epugh/.testcontainers.properties", "read";
  permission java.io.FilePermission "/Users/epugh/.docker-java.properties", "read";
  permission java.io.FilePermission "/Users/epugh/.docker/-", "read";
  permission java.io.FilePermission "/usr/local/opt/go@1.22/bin/docker-machine", "read";
  permission java.io.FilePermission "/usr/local/opt/go@1.20/bin/docker-machine", "read";
  permission java.io.FilePermission "/Users/epugh/.asdf/installs/nodejs/20.18.3/bin/docker-machine", "read";
  permission java.io.FilePermission "/Users/epugh/.asdf/shims/docker-machine", "read";
  permission java.io.FilePermission "/Users/epugh/.nvm/versions/node/v14.21.2/bin/docker-machine", "read";
  permission java.io.FilePermission "/Users/epugh/.rbenv/shims/docker-machine", "read";
  permission java.io.FilePermission "/usr/local/bin/docker-machine", "read";
  permission java.io.FilePermission "/usr/local/sbin/docker-machine", "read";
  permission java.io.FilePermission "/System/Cryptexes/App/usr/bin/docker-machine", "read";
  permission java.io.FilePermission "/usr/bin/docker-machine", "read";
  permission java.io.FilePermission "/bin/docker-machine", "read";
  permission java.io.FilePermission "/usr/sbin/docker-machine", "read";
  permission java.io.FilePermission "/sbin/docker-machine", "read";
  permission java.io.FilePermission "/usr/local/MacGPG2/bin/docker-machine", "read";
  permission java.io.FilePermission "/Library/Apple/usr/bin/docker-machine", "read";
  permission java.io.FilePermission "/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin/docker-machine", "read";
  permission java.io.FilePermission "/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin/docker-machine", "read";
  permission java.io.FilePermission "/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin/docker-machine", "read";


@janhoy
Copy link
Contributor Author

janhoy commented Sep 22, 2025

Yea, that’s annoying. Perhaps we could disable JSM for this test or for tests in the entire module?

@iamsanjay
Copy link
Contributor

I had the similar experience as I was upgrading kafka. And then I stopped.

Java Security Manager and Testcontainers do not play nicely together.  We prefer Testcontainers, so disable JSM
@epugh
Copy link
Contributor

epugh commented Sep 22, 2025

When I first saw DummyExtractionBackend my first thought was that it should be in the test class hierarchy. However, would there be value in keeping it? If you wanted to test your set up in Solr (and not worry about the Tika side), could it be useful for that? "I send a doc and I get something back"....

Add common metadata
Adjust some tests with dc:title instead of title
Support passwords in TikaServer backend
* @deprecated Will be replaced with something similar that calls out to a separate Tika Server
* process running in its own JVM.
*/
@Deprecated(since = "9.10.0")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epugh I undeprecated this and the Loader, and instead deprecated the Local backend. This part needs to be backported before 9.10 release. Also perhaps wording in major-changes...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Dependency upgrades documentation Improvements or additions to documentation module:extraction test-framework tests tool:build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants